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Abstract 

A statistical language model assigns probability to strings of arbitrary length. 
Unfortunately, it is not possible to gather reliable statistics on strings of arbi- 
trary length from a finite corpus. Therefore, a statistical language model must 
decide that each symbol in a string depends on at most a small, finite num- 
ber of other symbols in the string. In this report we propose a new way to 
model conditional independence in Markov models. The central feature of our 
nonuniform Markov model is that it makes predictions of varying lengths us- 
ing contexts of varying lengths. Experiments on the Wall Street Journal reveal 
that the nonuniform model performs slightly better than the classic interpo- 
lated Markov model. This result is somewhat remarkable because both models 
contain identical numbers of parameters whose values are estimated in a similar 
manner. The only difference between the two models is how they combine the 
statistics of longer and shorter strings. 

Keywords: nonuniform Markov model, interpolated Markov model, condi- 
tional independence, statistical language model, discrete time series. 
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1 Introduction 



The task of statistical language modeling is to accurately predict the future 
utterances of a language user. The probability that a given language user will 
produce a given utterance at a given moment depends on the language user's 
knowledge of language and of the world. Our current understanding of the 
language user's cognitive abilities is too impoverished for us to build plausible 
models of the language user's knowledge, and so we must be content to model 
the observables as best we can. Here the observables are the word sequences 
produced by language users. And so our goal is to assign accurate probabilities 
to word sequences. 

The interpolated Markov model || and its cousin the backoff model |Q, |], |l8f 
have long been the workhorses of the statistical language modeling community. 
These traditional models rely only on the frequencies of strings up to a fixed 
length. Recent research in statistical language modeling has focused primarily 
on developing more powerful model classes as well as on adding new 

sources of information to the traditional models Jl 2|, In contrast, the goal 
of this work is to find a more effective way to use the statistics of finite length 
strings. The distinguishing feature of our model is that it acquires beliefs about 
conditional independence, and uses those beliefs to make predictions of varying 
lengths using contexts of varying lengths. 

We believe that our work has two contributions to offer to the field of Markov 
modeling. The first contribution is our interpretation of the interpolation pa- 
rameters as beliefs about conditional independence. Prior work on interpolated 
Markov models has interpreted the interpolation parameters as smoothing the 
"specific probabilities" with the "general probabilities" ||, Our interpre- 
tation gives rise to the second contribution of our work, namely, a class of 
nonuniform Markov models that make predictions of varying lengths using con- 
texts of varying lengths. Nonuniform predictions is a principled way to perform 
alphabet extension, that is, to make a string become a symbol in the alphabet, 
an ad hoc technique that can improve model performance || . 

The remainder of this report is organized into four sections. In section ^ we 
motivate the nonuniform model as arising from the proper generative interpre- 
tation of our beliefs about conditional independence. In section |^ we provide 
efficient algorithms for evaluating the probability of a string according to a 
nonuniform model, for finding the most likely nonuniform generation path for 
a given string, and for optimizing the parameters of a nonuniform model on 
a training corpus. Finally, in section |I] compare the performance of the clas- 
sic interpolated Markov model and the nonuniform model on the Wall Street 
Journal. The nonuniform model performs slightly better than the classic model 
under equivalent experimental conditions. This result is somewhat remarkable, 
since the only difference between these two models is how they interpret the 
interpolation parameters. 
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2 Nonuniform Model 



A statistical language model assigns probability to strings of arbitrary length. 
Unfortunately, it is not possible to gather reliable statistics on strings of ar- 
bitrary length from a finite corpus. In practice, this difficulty is quite severe. 
There are k n logically possible strings of length n over an alphabet of size k, but 
there are at most T — n + 1 distinct strings of length n in a corpus of length T. 
Nearly all of the n-grams do not occur in any finite corpus, and of the n-grams 
that do occur, nearly all occur only once. Therefore, we must decide that each 
symbol in a string depends only on at most a small, finite number of other 
symbols and is conditionally independent of all other symbols in the string. 

For example, a Markov model of order n stipulates that each symbol depends 
only on the n most recent symbols, and is conditionally independent of all other 
past symbols, 

p(Xi\x! . . . = p(Xi\Xi_ n . . . 

where the probability p{x T \T) of a string x T of length T is then calculated as a 
product of T conditional probabilities. 

p{x T \T) = l\T =1 p(x i \xi...x i -i,T) 

= IlJ=lP( x i\ x i-ri ■ ■ -Xi-l,T) 

We are trying to model the observable correlates of a cognitive process far 
more complex and powerful than a fixed order Markov model. Consequently, 
we cannot afford to take such a simple-minded approach to conditional indepen- 
dence. Rather than stipulate the point of conditional independence a priori, as 
in a Markov model, we would like our model to acquire beliefs about conditional 
independence based on empirical evidence. 

In this section, we provide three different generative interpretations for the 
state-conditional interpolation parameters of a Markov model. These interpre- 
tations give rise to an interpolated context model, an interpolated state model, 
and our nonuniform model. Next, we compare the ability of these three in- 
terpretations to model local independence and global independence. We argue 
that the nonuniform model combines the ability of the state model to properly 
model global independence with the ability of the context model to properly 
model local independence. Finally, we prove that the nonuniform model is fun- 
damentally different from the other two models because it is not possible to map 
a nonuniform model into an extcnsionally equivalent context or state model. 

Let us first define our notation. Let A be a finite alphabet of distinct symbols, 
\A\ = k, and let x T £ A T denote an arbitrary string of length T over the 
alphabet A. Then x\ denotes the substring of x T that begins at position i and 
ends at position j. For convenience, we abbreviate the unit length substring x\ 
as Xi and the length t prefix of x T as x*. 
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2.1 Three Interpolated Models 

An interpolated Markov model <p — (n, A, 5, A) consists of a maximal string 
length n, a finite alphabet A, a set of string probabilities S : A- — > [0,1], 
and the interpolation parameters A : A <n — > [0,1]. Given a string y l , I < 
n, the string probabilities <S(y ) are typically their empirical probabilities in a 
training corpus. The only difference between our three models will be how the 
interpolation parameters A are interpreted. 

Let us now consider three generative interpretations of the interpolated 
Markov model: the context model, the state model, and our nonuniform model. 
A context model interprets the A parameters as combining the predictions from 
Markov models of varying orders. A state model interprets the A parameters as 
hidden transitions from a higher order Markov model to a lower order Markov 
model. The state and context models are both uniform models because they 
always predict unit-length strings. A nonuniform model interprets the A param- 
eters as beliefs about conditional independence. 

In each case, we let p c (i\x\_ m+x ) be the probability that we pick a context 
of length % in the history x\_ m+1 and let p v (y{ \xl_ i+1 ) be the probability that 
we make a prediction y\ of length j in the chosen context a;* 

2.1.1 Context Model 

In the interpolated context model, the interpolation parameters are understood 
as smoothing the conditional probabilities estimated from longer histories with 
those estimated from shorter histories || |l3) . Longer histories support stronger 
predictions, while shorter histories have more accurate statistics. Interpolating 
the predictions from histories of different lengths results in more accurate pre- 
dictions than can be obtained from any fixed history length. This interpretation 
of the interpolation parameters was originally proposed by Jelinek and Mercer 
[gj. It leads to the following generation algorithm, where the hidden transition 
from a longer context to a shorter context (line 3) is temporary, used only for 
the current prediction (line 4). 

CONTEXT-GENERATE(T,0) 

1. Initialize t := 0; x\ := e; 

2. Until i = T 

3. Pick context length i in [0, mm(t, n — 1)] 

p&\x*) = K4_ l+1 ) n;:Ln( t ,^i)(i - k<- 1+1 )) 

4. Make one symbol prediction y 1 

Pv{y l \x\-i+x) = 5{y 1 \x\_ l+1 ,i + 1) 

5. Extend history x\ by prediction y 1 

x[ +1 := x\y l \ t:=t + l; 

6. return(cc T ); 
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The probability p c (xi\x l ,4>) assigned by an interpolated context model </> 
to a symbol Xi in the history x l ~ Y has a particularly simple form ([!]), 

Pcixitf- 1 ^) = \(x i ~ 1 )5(x l \x i ~ 1 ) + (1 - \(x l ~ 1 ))p c (x l \x i 2 ~ 1 , <f) (1) 

where \(x l ) = for i > n and A(e) = 1. 

2.1.2 State Model 

Alternately, the interpolation parameters may be understood as modeling our 
beliefs about how much of the past is necessary to predict a state transition in 
an underlying Markov source of unknown order. This interpretation leads to 
the following generation algorithm, where the hidden transition from a state of 
a higher order model to a state of a lower order model (line 3) is permanent 
(line 4). 

STATE-GENERATE(T,</)) 

1. Initialize t := 0; x\ := e; m := 0; 

2. UntiU = T 

3. Pick context length i in [0, m] 

PMx\_ m+l ) = K x \-i+i) TViZLi 1 - A (^-i+i)) 

4. m := i; 

5. Make one symbol prediction y 1 

Pviy^xl^^) = 5(y 1 \x t t _ t+1 ,i+ 1) 

6. Extend history x\ by prediction y 1 

x t+1 := x t y , t := t + 1; m := min(m + l,n — 1); 

7. return (x T ); 



2.1.3 Nonuniform Model 

We develop the following model of conditional independence. Let i(x n ) be our 
degree of belief that x n depends on x\ in a string x" of length n 

i{x n ) = p(p(x n \xi . ..x n -i) ^p(x n \x 2 ■ ■ ■Xn-l)) 

and let \{x l ) be our degree of belief that the next n — i symbols depend on xi, 
a kind of expected dependence. 

Our beliefs about independence are determined in large part by the robustness 
of our statistics. If we do not believe that our model 8{-\x l ) of the source state 
transition probabilities is accurate, then our X(x l ) will be low. 
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Our beliefs about conditional independence have two implications. The first 
implication, as in the uniform model, is that we should transition from a longer 
context x % to the shorter context x\ with probability 1 — \{x r ). This expresses 
our belief of degree 1 — A(x l ) that the future does not depend on x\. The 
second implication, which is unique to the nonuniform model, is that we should 
transition from a shorter prediction y^ 1 to a longer prediction y- 7 in the chosen 
context x 1 with probability \{x % y 3 }. This implication follows from our belief 
of degree \{x l y 3 ~ 1 ) that the future depends on the entire string x l y 3 ~ 1 and does 
not depend on any symbol further in the past. Our novel interpretation leads 
to the following nonuniform generation algorithm. 

NONUNIFORM-GENERATE(T,0) 

1. Initialize t := 0; x\ := e; 

2. UntiU = T 

3. Pick context length i in [0,min(t,n — 1)] 

pSW) = A(x*_ l+1 )n;iii„ (t ,„-i)(l - K4-i+i)) 

4. c := x\_ i+1 ; j max := max(n - i,T - t); 

5. Pick prediction y\ of length j in [l,j max ] 

p v (y{\c) = (1- X(cyi))8( yj \cyi-\i + j)Ulzl \(cy[)S(yi\cy[-\l + i) 
where A(q/^ max ) = 0. 

6. Extend history x\ by prediction yl 

x\ +j := x\y{; t := t + j; 

7. return(x T ); 

The nonuniform model behaves both like a state model and like a context 
model. The transition from a longer context to a shorter context (line 3) con- 
tinues for the duration of the resulting prediction (line 5). If a unit length 
prediction is made, then the nonuniform model behaves exactly like the context 
model. However, if a longer prediction is made, then the nonuniform model 
behaves more like the state model. 

2.2 Two Situations 

Let us examine the behavior of our three model classes in two situations. The 
first situation is a point of local independence, where the current prediction 
does not depend on the history but later predictions do. In such a situation, the 
context model will outperform the state model. The second situation is a point 
of global independence, where no subsequent prediction depends on the current 
history. In such a situation, the state model will outperform the context model. 
The nonuniform model will perform reasonably well in both situations. 

The first situation to consider is a point of local independence, where the 
immediate future y\ does not depend on any suffix of the history x* , while the 
longer term future y% depends on the entire past x*yi. In such a situation, all 
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p c (x |3, <f>) = S(xi) 



\(xi)5(x2\x 1 ) 
+(1 - A(a;i))<5(a; 2 ) 



\(x 2 )S(x 3 \x 2 



+ (1-A(a: 2 )) 



A(x 2 )5(a:3|2:2) 
+(1 - A(a:2))5(a;3) 



(2) 



p s (x |3, </>) = 5(a;i) 



A(xi)5(a;2|a;i) 



(1-A(* 2 )) 

-X' 



(1 - X(x 2 ))8(x 3 ) 

+X(X2)S(X 3 \X2) 



(3) 



+ (1 — \(xi))5(x2) 



\{X2)5(X 3 \X2) 

+ (1 - \(X2))S(X 3 ) 



p„(x |3, 4>) = 8{xi) 



\{x\)8(x2\xi) 



{1-X{x 2 )) 
+\{x 2 )8(x 3 \x 2 ) 

\(X2)8(X 3 \X2) 



(1-X(x 2 )) 
+X(x 2 )S(x 3 



(1 - \(x 2 ))5(x 3 ) 
+X(x2)8(x 3 \x 2 ) 



+(1 - X(x 1 ))S(x 2 ) 



+(1-A(x 2 )) 



A(x 2 )(5(a;3|a; 2 ) 
+(1-A( a; 2 )) 



A(a: 2 )5(a;3|a;2) 
+(1 - A(» 2 ))(5(a;3) 
(4) 



Figure 1: The total probability assigned to a string x 3 by the three generative 
interpretations of the interpolated trigram model. The context model is shown 
in (j|), the state model in (||), and the nonuniform model in (||). 
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A(x|) will be close to zero, while the X(x[y) will be close to unity. Consequently, 
the context model will accurately predict p(-\x f ) using the empty context e and 
then predict p(- y) using the full context x l y. In contrast, the state model 
will transition from the x\ context all the way to the empty context e with high 
probability, which then obliges it to predict p(-\x l y) using the weak context y. 
The behavior of the nonuniform model depends on the value of X(y). If X(y) is 
high, then the nonuniform model will behave more like the state model, while 
if X(y) is low, then it will behave more like the context model. 

The simplest example of such a situation is an interpolated trigram model 
on a string x 3 of length 3, where p(-\xi) = p(-\e) but p(-\x 2 ) ^ p(-\x 2 ) and 
p('\x 2 ) — p{-\e). Then X(xi) and X(x2) are close to zero, while X(x 2 ) is close to 
unity. Consequently, the state model must incorrectly treat all three symbols as 
being independent (^a), while the context model (|^b) and the nonuniform model 
(He) are able to correctly treat x 2 as independent of xi, while also treating x 3 
as dependent on both x\ and x 2 . 

a. p s (x 3 \(f>) » S(x 1 )S(x 2 )S(x 3 ) 

b. Pc (x 3 \4>) « S( Xl )6(x 2 )6{x 3 \x 2 ) (5) 

c. p n (x 3 \(t>) S(x 1 )8(x 2 )S(x 3 \x 2 ) 

The total probability assigned to a string x 3 by our three interpolated trigram 
models appears in figure [i] 

The second situation to consider is a point of global independence, where the 
entire future y n is completely independent of the past a;"" 1 . Such a situation 
will arise in practice when all suffixes of the history x n_1 are rare, or when 
the source p(y™|a;" _1 ) = p(y n \e) for all i. In this situation, we would like to 
ignore the entire history a;™ -1 when making our predictions. All A(a;™ _1 ) and 
A(x™ _1 2/i _1 ) will be close to zero, but never identically zero. Due to inadequate 
statistics at a point of independence, nearly all ^(yp 1 will be zero, and 

to simplify the example we assume that all are zero. 

Once the state model transitions to the empty context in order to predict 
the first symbol y±, it need never again transition past any suffix of x n . The 
total probability assigned to p(A n \e) by the state model (^a) is a product of 
n — 1 probabilities. In contrast, the context model must transition past some 
suffix of the history x n_1 for each of the next n — 1 predictions, and so the 
total probability assigned to p(A p \e) by the context model (^|b) is a product of 
n(n — l)/2 probabilities. Note that (||b) must be considerably less than (^a). 
Here the nonuniform model behaves like the state model by first transitioning 
to the empty context and then predicting the string y" of length n, and so the 
total probability assigned to p(A n \e) by the nonuniform model (^|c) is a product 
of only 2n — 2 probabilities. Therefore the total probability assigned to y n by 
the nonuniform model is considerably greater than that assigned by the context 
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model (|^b) and only slightly less than that assigned by the state model ([^). 



(6) 



The point of these examples has been to illustrate how the nonuniform model 
combines the best characteristics of the state and context models. Like the state 
model, it can effectively ignore a misleading history. And like the context model, 
it does not get tricked by points of local independence. 

2.3 Inequivalence 

The only difference between the three model classes is how they interpret the 
A parameters. This raises the question of whether the nonuniform interpreta- 
tion has substance, that is, whether every nonuniform model might really be 
equivalent to some uniform model. Here we argue that nonuniform models are 
fundamentally different from uniform models, because it is not possible to map a 
nondegenerate nonuniform model into an extensionally equivalent context model 
or state model (theorem 0) . 

We say an interpolated model is degenerate iff it is equivalent to some simpler 
model, that is, equivalent to a model with fewer parameters. Formally, an 
interpolated Markov model = (n, A, 6, A) is degenerate iff either (i) some A 
value is either or 1 or (ii) some higher order transition probability is equivalent 
to a lower order transition probability, ie., = 5{xiJ r \\x 2 ) for some x\ 



Theorem 1 For every nondegenerate (f> = (n,A,S,X) and <f>' — (n, A, S, A') , 
with n > 1, there exist strings x % G A* and y J £ A + such that the nonuniform 
probability p n {y J \x l , eft) is not equal to the context model probability p c {y J \x l , 4>') 
or the state model probability p s {y J \x l , <j)'). 

Proof. Either A = A' or A ^ A'. 

Case i. If A ^ A', then Pniy 1 ^ 1 ,^) ^ p c (y 1 \x\ </>') and p„(y 1 |a; l ,0) ^ 
p s (y 1 \x t , 4>') for some x % 6 A + because all three interpretations of <fi are trivially 
identical for all one symbol predictions y 1 G A. 

Caseii. Otherwise A = A', and thenp n (y 1 \x l 7 (f>) = p c (y 1 \x' t , <j)') = p s (y 1 |x 4 , (/)') 
for all x % and y . However, now it is straightforward to show that p n {y^ \x l , </>) ^ 
Pciy^x 1 , 4>') and Pniy 3 4>) ^ Psiy-'lx 1 , 4>') for some x % and y^ £ A 1 with j > I. 
We consider the simplest nondegenerate situation, which is n = 2, j = 2, and 
i = 0. This corresponds a bigram model predicting two symbols using an empty 
context. In this situation, the state model and the context model both assign 
the same uniform probability p u {y 2 \4>) to y 1 . Then 



Pu(y 2 \</>) = S(y 1 )[X(y 1 )6(y 2 \y 1 ) + (1 - \(yi))S(y 2 )} 

Pn(y 2 \& = 6(y 1 )[X(y 1 )S(y 2 \y 1 ) + (1 - X(y 1 ))[X(y 1 )S(y 2 \y 1 ) + (I - X( yi ))6(y 2 )} 



in A <n . 
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and 

Pn(y 2 \4>) -Pu{y 2 \4>) = A(yi)(l - \( Vl ))8{ yi )[5{y 2 \ yi ) - 8 fa)]. (7) 

By the definition of degeneracy, neither A(yi), 1 — A(yi), nor <5(£/2 1 J/i ) — 6(1/2) can 
be zero. By the axioms of probability some y\ must have nonzero probability, 
which means that 8{y\) must be nonzero for that yi, and therefore equation (f7|) 
must also be nonzero for that y\. □ 

It is instructive to note that the difference (Q) between the uniform and 
nonuniform interpretations of a given <fi is proportional to the difference between 
the conditional probability <5 (2/2 1 2/1 ) an( i the marginal probability £(2/2)- If 2/2 
and yi are truly independent, then with high probability ^(3/2 Iz/i) ~ 6(2/2) hi our 
training corpus and both interpretations assign essentially the same probability 
to y 2 , regardless of our beliefs about conditional independence. If, however, yi 
truly depends on y\ then with high probability 6(1/2 12/1) 7^ 6(1/2) m °u r train- 
ing corpus and the difference between the context model interpretation and the 
nonuniform model interpretation depends principally on our beliefs of condi- 
tional independence. This difference is maximized for A(yi) = 0.5, ie., when we 
are maximally uncertain, and vanishes when A(yi) approaches or 1, ie., as our 
certainty grows. 



3 Nonuniform Algorithms 

Having defined the class of nonuniform models, and compared them to the two 
uniform models, let us now consider how we might effectively use the nonuniform 
model class in practice. Here we provide efficient algorithms to evaluate the 



probability of a string according to a nonuniform model (section 3.1), to find the 
most likely generation path for a string according to a nonuniform model (section 
|3.2|) , and to optimize the parameters of a nonuniform model on a training corpus 



(section 3.3) 



3.1 Evaluation 

The nonuniform model <fr assigns probability to generation paths paired with 
the strings that they generate. A string may have more than one generation 
path, and so the marginal probability of a string x T is determined by summing 
the joint probabilities over all generation paths s. 

p(x T |0,T) = ^ P (x T , s |0,r) 

s 

There are only polynomially many generation paths for a given string. 

The following dynamic programming algorithm evaluates the probability of 
a string x T of length T in 0{n 2 T) time and 0{T) space. The space requirements 
of the algorithm may be reduced to 0(n) at a slight expense in clarity. Note 
that A(:r^™J x ) = for j max = min(T — t,n — 1). 
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nonuniform-evaluate(x t ,4>) 

1. For t = 2 to T [ a t := ]; ai := 1; 

2. For t = 1 to T - 1 



3. p c = l; 

4. for « = min(t, n — 1) to 

5. p c := A(x*_ i+1 )p c ; p v := 1; 

6. for j = 1 to min(T — t,n — i) 

7- Pv := {l-\{x\t\ + i))5(x\%{\x\_ i+l ,i + j)p v \ 

8. a t+:) := a t+ j + a t p c p v ; 

9. p„ := A(x*+^ +1 ) Pt ,; 

10. p c := (I - \(x$_ i+1 )); 



11. return(aT); 

The a t variable stores the total probability p(x*|</>, t) for the substring x*. 
3.2 Decoding 

Decoding a string x T with respect to an nonuniform model (f> is the process of 
finding the single most likely generation path for that string. This computa- 
tion is performed in 0(n 2 T) time and 0(T) space by the following dynamic 
programming algorithm. 

nonuniform-decode(x t ,4>) 

1. For t = 2 to T [ a t := ]; ai := 1; 

2. For t = 1 to T - 1 



3. p c = 1; 

4. for i = min(t, n — 1) to 

5. p c := A(x*_ m )p c ; p„ := 1; 

6. for j = 1 to min(T — t,n — i) 

7. p v ==_(!_- ^S+OM^ml^-.+i^ + jK: 

8. if {atpcpv > a t+j ) then [ s i+:) := (i, j); a t+j := a t p c p v ; ] 

9. Pt, := A(x*+^ +1 )p t ,; 

10. pc := (1 - XixUi+i)): 



11. s :=^; t :=T; 

12. while (t > 1) [ s := s t s; t := t - s t ,i; ] 

13. return(s); 

The at variable stores the probability of the most likely generation path for 
x l , while the s t variable stores the last transition in the most likely generation 
path for x*. Each transition in the nonuniform model is a pair indicating 
that a context of length i was used to make a prediction of length j. The 
elements i and j of the pair s t = are identified by the notation s ty o and 

St,i, respectively. The most likely generation path is stored in the s variable. 
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3.3 Estimation 



In this section, we formulate an expectation maximization (EM) algorithm for 
the nonuniform Markov model. Our development follows the traditional lines 
established for the hidden Markov model |^|, [|. (See jl6| for a tutorial.) Recall 
that we must first calculate the expected number of times that each hidden event 
occurred for a given training sequence. The hidden events for the nonuniform 
model are the choice of context and prediction lengths. 

We begin by defining our forward and backward variables. The forward 
variable ctt(i,j) contains the probability of generating the first t symbols of the 
history, picking a context of length i and then making a prediction of length j, 
according to the model <p. 

a t (i,j) =p(h = x\,c = x\_ i+x ,v = x\X{\<j>,T) (8) 

The following iterative algorithm calculates all at(i,j) values in 0(n 2 T) time 
and 0(n 2 T) space. 

FORWARD(a; T ,0) 

1. For j = 1 to n [ a (0,j) := p v {x{ |e); ]; 

2. For t = 1 to T 

3. a t := E,tf ' n) J2to in ^ 3) «t-i(i,i): 

4. For i = to min(t, n — 1) 

5. For j = 1 to min(T — t,n — i) 

6. a t (i,j) \= a t Pc{i\x\)pv(x\X{\x\_ i+1 )\ 



The backward variable (3t{i,j) contains the probability of generating the 
final T — t symbols in the string xj , given that the history is x\ and that we 
have chosen to make a prediction of length j in a context of length i according 
to the model 4>. 

Pt(i,j) = p(xf +1 \h = x t 1 ,c=x t t _ i+1 ,v = xlX{\(t>,T) 

= p(x? +j+1 \x{ + t,4,)=p t+j [J) 

The following iterative algorithm calculates all /3* values in 0(n 2 T) time and 
0(T) space. Note that we need only maintain a one dimensional table of (3 
values because Pt{i,j) — Pt+j for all i,j. 

backward(x t ,<?!>) 

1. #r:=l; 

2. For t = T - 1 to 
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The forward and backward variables allow us to calculate the posteriori prob- 
ability of every hidden transition in our model, as represented by the following 
"ft(i,j) variable. 

7*0, j) = P(c = x\_ i+1 ,v = x\X\\xl,4>) , 1Q , 
= M i J)0t{i,j)/p(xi\<f>)=a t (i,j)0 t+j /p(xT\<j>) 

We use the following useful fact to verify our implementation of the 7 com- 
putation. 

Theorem 2 The following constraint holds for the 7 values: 

T—l min(t,n— 1) min(T— t,n— i) 

T = E E E 3-it{ij) (ii) 

t=0 j=0 j=l 

Proof. Recall that jt(i,j) represents the posteriori probability that the nonuni- 
form model made a prediction of length j using a context of length i at time 
t in the input string x T . Each such stochastic transition consumes exactly j 
symbols of the input. Consequently, summing the "ft(i,j) over the prediction 
lengths j multiplied by the prediction lengths j yields the expected number of 
symbols predicted at time t from a context of length i. Summing this quantity 
over the context lengths i yields the the expected number of symbols predicted 
at time t, independent of context length. Finally, summing this expectation 
over all the times t must yield the total number of symbols in a string x T . □ 
We sum the 7 values to obtain the expected number of times that the nonuni- 
form model transitioned from a longer context to a shorter one, or from a shorter 
prediction to a longer one. We use two variables to keep track of our expecta- 
tions: \ + {y l ) accumulates the number of times that we used y l to condition our 
prediction when it was possible to do so, while A~ (y l ) accumulates the num- 
ber of times that we could have used y l to condition our prediction but chose 
a proper suffix instead. The following algorithm accumulates all X + (y l ) and 
X~(y l ) values in 0(n 3 T) time and 0(n 2 T) space. 

EXPECTATION-STEP(x T A+ , A~ ) 

1. For t = 1 to T 



2. For i = to min(i, n — 1) 

3. For j = 1 to min(T — t,n — i) 

4- A+(z<_ i+1 )+ = 7 t (i,j); 

5- \-{x\ + _ i+l )+ = it{i,3); 

6. For / = i + 1 to mm(t, n — 1) [ A {x\_ l+l )+ = 7 t (i, j); ]; 

7. For l=j + lto min(T -t,n-i)[ X + (x t +\ +1 )+ = lt{i,j)\ ]; 



Having done all the work in the expectation step, the maximization step is 
straightforward. 
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MAXIMIZATION-STEP(0,A + ,A ) 

1. For all strings y l in A <n 

2. X(y l ):=X + (y l )/(X + (y l ) + X-(y 1 )); 

The following DELETED-estimation() algorithm estimates the parameters 
of an interpolated model 4> using a set B of blocks of text. For each iteration, 
we delete one block Bi from the set B, initialize the string probabilities 8 to 
their empirical probabilities in the remaining blocks B — Bi (line 4), and then 
perform an expectation step on the deleted block Bi (line 5). After all blocks 
have been deleted, we update our model parameters (line 6). 

DELETED-ESTIMATION (B , (j)) 

1. Until convergence 

2. Initialize A + , A~ to zero; 

3. For each block Bi in B 

4. Initialize 5 using B — B L ; 

5. EXPECTATION-STEP(i3i,0,A + ,A~); 

6. MAXIMIZATION-STEP((/f),A + ,A _ ); 

7. Initialize S using B; 



4 Experimental Results 

In this section we compare the performance of the interpolated context model 
and the nonuniform model on the Wall Street Journal. (Recall that the in- 
terpolated context model is the classic interpolated Markov model of Jelinek 
and Mercer §.) We performed two sets of experiments. The first set of ex- 
periments was with the 6.2 million word WSJ 1989 corpus. The goal of these 
initial experiments was to better understand how initial parameter values affect 
model performance. The second set of experiments was with the 42.3 million 
word WSJ 1987-89 corpus. In order to assess the possible value of our language 
models to speech recognition, we used verbalized punctuation and a vocabulary 
of approximately 20,000 words chosen from both training and test sets. Out- 
of-vocabulary words were mapped to a unique OOV symbol. In all cases, we 
used 90% of the corpus for training and 10% for testing. No parameter tying 
or parameter selection was performed. We report performance as test message 
perplexity. 

We set the 5 parameters to be the empirical probabilities in the training 
data and then optimized the A parameters on the training data using deleted 
interpolation || Ej. We soon discovered that the initial values for the A pa- 
rameters had a noticeable effect on model performance as did the block size 
used for deleted interpolation. Larger block sizes result in more conservative 
estimates, which work better when the corpus is small relative to the alphabet 
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size and worse when the corpus is large relative to the alphabet size. More 
aggressive initial estimates for the A parameters give better initial performance 
for some model orders but worse ultimate performance. Regardless of how the A 
parameters were initialized or what block size was used, the nonuniform model 
performed slightly better than the uniform model under equivalent experimental 
conditions. 

We considered three initial estimates for the A values: uniform, the Jeffreys- 
Perks rule of succession || |l4], , and the natural law of succession |l7j] . The 
uniform estimate sets all A values to 0.5. The Jeffreys- Perks rule sets \(x l ) to 
c(x l )/(c(x l ) + fc/2), for alphabet size k and string frequency c{x % ). Jeffreys-Perks 
is a conservative estimate, that assigns relatively low probability to A(ir 4 ). The 
natural law sets \{x % ) to 

Cjx^jcjx 1 ) + 1) + q(x l ){l - qjx 1 )) 

c^x 1 ) 2 + cix*) + 2q(x l ) 

for string frequency c(x l ) and context diversity q(x l ) = \{y : c(x l y) > 0}|. The 
natural law is an aggressive estimate that assigns relatively high probability to 
A(x l ). The best performance for higher model orders was achieved with uniform 
initialization in all of our experiments. 

4.1 WSJ 1989 

The first set of experiments was on the 1989 Wall Street Journal corpus, which 
contains 6,219,350 words. Our vocabulary consisted of the 20,293 words that 
occurred at least 10 times in the entire WSJ 1989 corpus. The goal of these 
initial experiments was to better understand how initial values affect model 
performance. 

4.1.1 Before Optimization 

The following table reports test message perplexities for WSJ 1989 before the A 
parameters were optimized using deleted interpolation. The best results for both 
models are obtained when the A parameters are initialized uniformly. Before op- 
timization the interpolated context model performs better than the nonuniform 
model. 





Context Model 




Nonuniform Model 




N 


Jeffrey-Perks 


Natural Law 


0.5 


Jeffrey-Perks 


Natural Law 


0.5 


2 


284.9 


188.2 


215.9 


276.8 


197.6 


209.6 


3 


248.1 


148.7 


136.0 


235.8 


175.4 


138.4 


4 


241.6 


155.0 


130.0 


229.3 


196.3 


138.3 


5 


239.6 


161.7 


131.3 


227.6 


211.4 


142.6 


6 


238.7 


165.7 


132.6 


226.9 


219.4 


145.2 
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4.1.2 After Optimization 

The following table reports test message perplexities for WSJ 1989 after op- 
timization via deleted interpolation. All models were trained using deleted 
interpolation with 22 blocks on the first 90% of the corpus and then tested on 
the remaining 10% of the corpus. The nonuniform model slightly outperforms 
the context model for n > 3. The best results for both models are obtained 
when the A parameters are initialized uniformly. The nonuniform model is loss 
sensitive to the initial A estimates than the context model. 





Context Model 




Nonuniform Model 




N 


Jeffrey-Perks 


Natural Law 


0.5 


Jeffrey-Perks 


Natural Law 


0.5 


2 


175.3 


175.2 


175.2 


177.7 


177.6 


177.7 


3 


122.1 


121.8 


121.2 


121.6 


121.6 


121.2 


4 


115.8 


115.9 


114.0 


113.6 


114.1 


113.2 


5 


114.5 


115.4 


112.6 


111.9 


113.0 


111.4 


6 


114.1 


115.6 


112.3 


111.5 


112.9 


111.0 



4.2 WSJ 1987-89 

The second set of experiments was on the 1987-89 Wall Street Journal corpus, 
which contains 42,373,513 words. Our vocabulary consisted of the 20,092 words 
that occurred at least 63 times in the entire WSJ 1987-89 corpus. The goal of 
these experiments was to produce competative results for the context model, 
in order to compare those results to those achieved by the nonuniform model. 
We believe that we are the first to report WSJ 1987-89 results for full (ie., 
unpruned) interpolated Markov models of higher order than trigrams. 

4.2.1 Before Optimization 

The following table reports test message perplexities for WSJ 1987-89 before 
optimization via deleted interpolation. All A values were initialized uniformly. 



N 


Context Model 


Nonuniform Model 


2 


198.2 


190.1 


3 


107.5 


106.1 


4 


97.7 


100.4 



4.2.2 After Optimization 

The following table reports test message perplexities for WSJ 1987-89 after 
optimization via deleted interpolation. All A values were initialized uniformly, 
trained using deleted interpolation with 152 blocks on the first 90% of the corpus, 
and then tested on the remaining 10% of the corpus. The nonuniform model 
performs slightly better than the context model for n > 2. 
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N 


Context Model 


Nonuniform Model 


2 


150.7 


151.7 


3 


93.4 


93.3 


4 


85.7 


84.4 



5 Conclusion 

We have proposed a nonuniform Markov model, that makes predictions of vary- 
ing lengths using contexts of varying lengths. We argue that the nonuniform 
model combines the ability of the context model to properly model situations 
of local independence with the ability of the state model to properly model sit- 
uations of global independence. We demonstrated that the nonuniform model 
slightly outperforms the interpolated context model on natural language text. 
This feat is somewhat remarkable when we consider that both models are 
based on the statistics of fixed-length strings, and that both models contain 
identical numbers of parameters whose values are estimated using expectation- 
maximization. The only difference between the two models is how they combine 
the statistics of longer and shorter strings. 
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